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ABSTRACT 

A review of the literature on multiple comparison 
procedures suggests several alternative approaches for comparing 
means when population variances differ. These include: (1) the 
approach of P. A. Games and J. F. Howell (1976); (2) C. W. Dunnett's 
C confidence interval (1980); ?nd (3) Dunnett's T3 solution (1980). 
These procedures control the overall ri.;k of a Type I error 
experimentwise at approximately the nominal significaiice level and 
have the best statistical power among alternative solutions. The 
two-stage multiple comparison procedures of Y. Hochberg and A. C. 
Tamhane (1987), and R. R. Wilcox (1987) are also discussed. These 
proced'ures and the Tukey-Krainer procedure were applied to data from a 
study of the effects of exercise on psychological and physiological 
variables with a total of 36 subjects. Textbooks and a sample of 
research studies were reviewed to determine the most frequently 
taught and used multiple comparison procedures. This review indicates 
that most applied researchers are not aware of the alternative 
solutions when variances differ. It is suggested that the 
Games-Howell procedure will provide valid test for most purposes 
and should be included in statistical methods textbooks and classes. 
Four tables present data from the application and the reviews. 
(SLD) 
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Multiple Comparison Procedures when Population Variances Differ 

Over the past several years I've had an ongoing discussion with 
several of my colleagues regarding teaching statistical methods classes. 
One of the topics that we've debated has been the role multiple 
comparison procedures plays lit" our courses. It is well known that many 
procedures have been developed and each has its advantages and 
disadvantages. From an instructors point of view, choosing from among 
the many alternatives poses the problem of what to discuss in class. 
When asked, most of my colleagues say that they provide instruction on 
three or maybe four techniques. One friend feels strongly that only one 
procedure need be ^aught. Typically, the procedures mentioned have been 
Tukey's HSD, Bonferroni, Fisher's LSD and Scheffe. Many others have 
been mentioned but these seem to me to be the most popular among those 
to whom I've spoken. The four procedures I've mentioned differ in 
sever important ways including: type I error rates, statistical power, 
types of contrasts to be examined, dealing with unequal sample sizes 
etc. But one characteristic that they all share is that they all assume 
that the populations sampled have equal variance. On refle'ttion, I 
don't believe that anyone I have spoken to, has ever mentioned providing 
instruction on multiple compaiison procedures when population variances 
differ. Several articles ^viewing multiple comparison procedures have 
commented on the variance heterogeneity problem and have recomm«-nded 
approximate solutions. Statisticians certainly have been hxicy developing 
new and improved approaches to the problem but I wondeied whether these 
solutions are taught and whether the applied researcher is familiar with 
the issues and the solutions. 



The purpose of our paper is threefold: First, I wanted to briefly 
review the statistical literature on the alternative approaches for 
comparing means when population variances differ. Second, I wanted to 
demonstrate the approaches with a small data set. And third, I wanted 
to review textbooks and a sample of research studies to determine what 
are the most frequently taught multiple comparison procedures (as 
indicated by textbook coverage) and how often do appli* ^searchers in 
educ ition and psychology consider the procedures that allow variances to 
differ. My comments are limited to situations where all pairwise 
contrasts are of interest and control of type I errors is set 
experimentwisB. 

Alternative Approximate Solutions 

Vhen population variances differ, several solutions have been 
suggested. Many of the proposed procedures control the overall risk of 
type I errors but have low statistical power. Three procedures that 
have often been recommended are those that have been developed by Games 
and Howell (1976) based on Welch's solution to the Behrens-Fisher 
problem, Dunnett C (1980) based on Cochran's (1964) solution to the 
Behrens-Fisher problem* and Dunnett T3 (1980) based on Sidak's (1967) 
xincorrelated-t inequality. These procedures control the overall risk of 
a type I error experimentwise at approximately the nominal significance 
level and have the best statistical power among the alternative 
solutions. 

The Games-Howell procedure constructs a confidence interval as 
follows: 
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Where X . , X^^ 
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are the sample means for groups j and k respectively, 
is the l-OC/2 centile of the studentized range 
distribution, 

is the number of levels of the independent variable, 
are the unbiased estimates of the population variances 
for populations j and k respectively 

are the number of observations available from populations 
j and k respectively. 
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Dunnett^s C confidence interval is constructed as: 
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the other terms are the same as those defined above. 

Dunnett^s T3 solution forms a confidence interval ^s: 



J ^ - a,w,l -A +^ 



(3) 



Where M is the 1-eC /2 centile of the Studentized Maximum-Modulus 
Distribution, 

other terms are as defined previously. 

All three procedures use the same estimate for the standard error 

and differ only in the identification of the critical values from their 

respective reference distrib:xtions. 
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Reviewers of the statistical literature differ in their 
recommendations regarding which procedure is "best". Keselman and Rogan 
(1977) and Jaccard^ Becker and Wood (198^) both recommend that when 
variances differ that the Games-Howell procedure be used. Games, 
Keselman and Rogan (1981) concluded that all three approaches are 
acceptable. Stoline (1981) recommends the T3 or C procedures and Wilcox 
(1987) concurs with Stoline. The reason for the disagreement focuses 
on some research findings that indicate that the Games-Howell procedure 
can have an inflated type I error rate experimentwise for some 
conditions. Tcjnhane (1979) found inflated type I error rates for 
several situations but no clear pattern was identified. Dunnett (1980) 
concluded that the Games-Howell procedure maybe slightly liberal when 
population variances were equal and became conservative as the variances 
differed. Games and Howell (1976) reported similar evidence but only 
studied a small number of situations where variances were equal » Wilcox 
(1987b) replicated Dunnett' s conditions for a 4x4 factorial structure 
and examined differences in cell means. His results were consistent 
with.Dxinnett' s in that the Games-Howell was found to be liberal when 
variances were equal. He also fotmd the Games-Howell procedure to be 
liberal when variances were unequal if cell sizes were small. All 
studies that considered the T3 and C procedures have consistently shown 
these procedures are conservative. The C procedure is more conservative 
than T3 when sample sizes are small but the opposite Is true when sample 
sizes are large. With infinite degrees of freedom C and the 
Games-Howell procedures become identical (Dunnett, 1980). When 
statistical power has been considered the Games-Howell procedure 
consistently provides narrower confidence limits. 
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In all of the situations where the Games-Howell procedure has been 
found to be liberal the sample sizes were small (n<15). When the 
smallest group had at least 15 observations the type I error rate did 
not exceed the nominal significance level. 

Exact Solutions 

The procedures presented above are not exact tests. The actual 
confidence intervals only approximate the nominal confidence level. 
When population variances differ no single stage procedure can provide 
an exact solution. An alternative approach to multiple comparisons 
which has not received much attention in the social science literature 
is to estimate differences between population means using a two- step or 
two-stage sampling procedure. Basically these procedures require the 
researcher to select samples from the popxxlations of interest, estimate 
the population variances, then depending on the acceptable margin of 
error and variance inequality, sample a second time from each 
population. Two two-stage multiple comparison procedures are discussed 
by Hochberg and Tamhane (1987) and by Wilcox (1987). Both texts present 
the two-stage procedures developed by Tamhane (1977) and Hochberg 
(1975). 

Tamhane *s two-stage procedure can be applied to any linear contrast 
but only pairwise differences are of interest here. In stage one random 
samples of individuals from each population are selected and basic 
descriptive statistics are computed (means, standard deviations). Next> 
the researcher determines the total number of observations needed for 
each group (n^) so that the margin of error is no greater than m \mits. 
This is determined from the following: 



1 
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n^. - max [ n^+l> (s^/d) +1] (4) 
Where is the initial sample size, 

^ s. is the sample variance of group j based on the initial sample 

of n , 
o 

2 

and d - (m/h) . (4b) 

Where m is the margin of error the researcher finu,- acceptable, 

h i" 1- oC /2 centile point of J independent Student's t variates, 
based on = n^-1 degrees of freedom, 

•k 

( ) indicates the integer value. 

At a minimum Tamhane's procedure require^ that at least one 
additional observation is needed from each population. 

A weighted mean for the two samples is then computed as: 

Xj - ^j^hi ^^"^j^^^lj ^^^^ 
Where X^^^, X^j are the sample means from the first and second sampling 

stages for group j ; 

k is the number of additional observations made in stage two; 

and bj is computed as: 

Un. { l+(l/sj)[n^(nj*d -s^)/k]^/^} (4d) 

A confidence interval is then constructed as the difference between the 

weighted means with the margin of error equalling m units: 

Xj - \ ± m. 

An alterr^.tive two-stage approach suggested by Hochberg(1976) 
determines the number of additional observations needed in the second 
stage based on the following s 

n. = max[n .(sVd) +1] (5) 
Tha terms are the same as those defined by Tamhane above. The 
confidence interval is provided by: 

Xj-X^ + h (maxEsJ^fS:, Sj^/frTj^l) {5b) 
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The difference between means is estimated using all of the data from 
samples at both stage 1 and 2. The variance estimate hov/ever, is based 
on data from f:he first sampling stage. 

An advantage for Hochberg's procedure is that additional 
observations may not be needed but a possible disadvantage is that the 
width of the confidence intervals can vary. 

Example Problem 

Since many applied researchers may not be familiar with these 
alternative procedures for contrasts when variances differ I thought it 
would be useful to demonstrate an application to a real data set. The 
data example I chose for the demonstration is taken from Moore and 
McCabe's (1989) new statistics textbook titled Introduction to the 
Practice of Statistics . One of the exercises in this text cites a 
dissertation study by Lobstein (1983). In this investigation the 
effects of exercise on psychological and physiological variables, were 
studied. Four groups were considered: A treatment group (T) who 
participated in an exercise program; a control group (C) who had 
volunteered to participate in the exercise program but for various 
reasons could not attend the treatment sessions, a group of, joggers (j) 
and a group of sedentary people (S) who did not exercise regularly. 
Ont of the outcome measures used in th'^ study was a physical fitness 
scale administered when the treatment was terminated. Descriptive 
statistics on ^ihe four groups are reported in table 1. 



Insert Table 1 about hera 



= 9.5 



For this demonstration I will focus on the contrast between the 

treatment (T) and control (C) groups. 

First, consider the Games-Howell procedure. Using (lb) the degrees 

of freedom are: 

fas. 17 ^ 32.07 
L 10 5 j 

L"ioJ 4- I 5 J 
10-1 5-1 

And the critical value from the Studentized range distribution is found 
using the truncated value of w as: 

%,9,.95 = ^'^2 
The .95 confidence interval from (1) would be estimated as: 

291.91 - 308.97 + 4.42/{T1 



'[^ 38. 17 * 32.07^ 



-17.06 + 58.58 

The critical value for Dunnett's C procedure found from (2b) is 
determined as: 

*4,9,.95~^*^^ 



snd 



^ 38.17"" 32.07*" 

4.42 + 5.76 — ^ 

38.17* 32. 07*- 

10 ~ 

= 5.204 

Using (2) the .95 confidence interval is estimated as: 
291.91 - 308.97 



-17.06 + 68.99 
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For Diinnett's T3 procedure the critical value from the maximum-modulus 
distribution is found as: 

"4,9,. 95 =^-2^ 

And the .95 confidence interval from (3) is found as: 

X TC— 

291.91 - 308.97 + 2.21 ^ii£7) 

N 10 5 

-17.06 + 6) .30 

For the two-stage procedures let's assume the data in table 1 
report sample data from the first sampling stage than to determine the 
number of additional obsei'vations needed. The critical value h from the 
distribution of independent Student t-variates has degrees of freedom 
equal to the integer value of a/2,(n^^-l) (Wilcox, 1987). 
fir= A/(.A909)=8 

^,8,.95= ^-^O 

If we assume that a margin of error equalling 65 is acceptable, then 
using (Ab) and (A) the samp?.2 sizes needed for the Treatment (T) and 
Control (C) groups are determined as: 
d= (65/A.AO)^ 



= 218.23 

•T 
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n^ = n>?*x [10+1, [(38. 17)^/218. 23j*+lj = 11 

lax [ 5+1, ((32.07)^/218.23f+l] = 6 

Thus for the contrast between the Treatment and Control groups one 

additional individual (k=l) would be needed from each group. 

The weighting factor from (Ad) for the Treatment and Control groups 

would be detfe jined as : 

b„ = 1/11{ 1+(1/38.17)[10(11*218. 23-38. 17^)7(11-10)]^^^} 
= .3223; 

b- = 1/6 { 1+(1/32.07)[5(6*218. 23-32. 07^/(6-5)]^^^} 
^ =.361A. 
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// 

Finally, t>ie weighted group means are calculated as; 

x!j,= .322?X2T + .6777X^^ 

Xj,= .aeiAX^Q + .6386X^g 
The confidence interval is estimated as: 
ICp -^j, ± 65.0. 

For Hochberg's procedure using (5) with the same value for d as was 
determined above: 

n^ = max [10, ^38. 17)^/218. 23|+ll = 10 

n^j = max [ 5, ((32.07)^/218.23]V] = 5 
In the present example additional observations would not be needed from 
either T or C groups. 

The confidence interval irom (5b) is found as: 
219.91-308.97+ 4.40(14.34) 
-17.06 + 53.11 

Table 2 present the results when these five procedures are used for 
the six possible pairwise contrasts for the problem. The Tukey-Kramer 
procedure was added for comparative purposer I chose the Tukey-Kramer 
procedure since it it fairly well known and is generally recommendcti for 



Insert Table 2 about here 

situations vjhere sample sizes are unequal, the error-rate is set 
experimentwise^ and it can be assumed that population variances are 
equal. For Tamhane's procedure one additional observation was needed 
for the T, C, and J groups and 9 additional observations would be needed 
from the S giroup. With Hochberg's procedure no additional observations 
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would be needed for the T, C, and J groups but 9 additional observations 
would be needed from the S group. The point estimates for the contrasts 
between groups would likely be different with Tamhane's procedure but 
only contrasts involving the S group would have point estimates that 
would change with Hochberg's procedure. 

Statistics Textbooks 

For the third portion of this paper we were intere5;ted in examining 
textbooks that might be used for statistical methods classes in the 
behavioral sciences. To identify possible texts we went to the 
libraries at the University of Georgia and obtained a listing of all 
books that were listed under the key words: behavioral statistics, 
statistics, and social sciences-statistical methods. A list of 161 
titles were printed. We then went through this list and identified 
only those books that might be used as a textbook and was published from 
1980 to the present. Textbooks on topics such as factor analysis, 
multivariate analysis, regression analysis, sampling we excluded from 
consideration. Finally after examining these books we only included 
those texts that discussed analysis of variance. For our analysis we 
examined 48 texts. From this list, 9 (19%) of the texts had no 
presentation on multiple comparisons or contrast analysis. Table 3 
summarizes the frequency with which the most popular multiple comparison 
procedures were presented. The percentages 



Insert Table 3 about here 
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reported in table 3 are based on the niunber of books that discuss 
multiple comparison procedures (39). By a considerable margin the Tukey 
HSD and Scheffe's multiple comparison procedures were the most 
popular techniques taught. Only 6 (15%) of the textbooks we examined 
discussed the issue of variance heterogeneity in tlie context of multiple 
comparisons. All six of these t*=txts had discussed the Games-Howell 
procedure but only two jf them commented on Dunnett^s T3 procedure. 

Journal Articles 

Finally, we were interested in examining empirical research studies 
that have tested hypotheses on the equality of means and have examined 
pairwise contrasts. For our review we examined five research journals: 
American Educational Research Journal . Journal of Educational 
Psychology > Reading Research Quarterly , Journal of Experimental 
Education, and Journal of Educational Research . We limited our review 
to the four year period between 1985 and 1988. Table 4 summarizes what 
we found. By far the most popular multiple comparison procedure used 
was the Newman-Keuls. This finding is consistent with the results 
reported by Jaccard, Becker and Wood (1982) who also found the 
Newman-Keuls procedure the most popular technique -^n a survey of 
articles published by the American Psychological Association in 1982. 
Contrary to what Jaccard et al found, our survey indicated that Tukey' s 
HSD procedure was the second most popular procedure. In reviewing the 
five research journals we did not find a single article that used a 
technique that did not assume equal variances. Perhaps this is not 
surprising since so few texts discuss the alternative procedures and 
those that do have only recently been published. It is possible of 
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course that in applied research studies the assiunption of equal 
variances is generally met so the techniques I've be discussing are 
rarely needed. We thought we might be able address this possibility by 
examining the descriptive statistics reported in the research articles. 
Unfortunately we found that many studies do not report indices of 
spread. Of those studies that did report the sample standard 
deviations, many appeared to have variances that were homogeneous but 
several had variances that differed by more than a factor of 2. 
Unequal sample sizes were generally common in the studies we examined. 

Conclusions 

Based on what I have read and learned about the issue of contrast 
analysis with heterogeneous variances I have come the following two 
conclusions. First I am pretty nuch convinced that most applied 
researchers are unaware of the problem and probably are unaware of the 
alternative solutions when variances differ. Second, I think that for 
most research studies the Games-Howell procedure will provide a valid 
test and should be included in statistical methods textbooks and 
classes. 
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Table i 



Means, standard deviations, and sample sizes for four groups 
investigating the effect of exercise on fitness. 







T 




C 


J 




S 


Mean 

Standard Deviations 
Sample Size 


291. 
38. 

10 


91 
17 


308.97 
32.07 
5 


366. 
41. 

11 


87 
19 


226.07 
63.53 
10 



T=Treatment, C=Control, J=Joggers, S=Sedentary. 
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Table 2 

Summary of half width's for confidence intervals based alternative 
procedures • 



Contrast^ 


Point 




GH 


C 




T3 


Tam 


H 






Estimate 
















T - 


c 


-17.06 


69.73 


58.58 


68.99 


61 


.30 


65.00 


63.11 


T - 


J 


-74.96 


55.63 


48.97 


53.56 


50 


.74 


65.00 


53.65 


T - 


s 


65.89 


56.94 


68.11 


73.25 


70 


.78 


65.00 


88.40 


C - 


J 


-57.90 


68.67 


59.29 


69.05 


62 


.04 


65.00 


63.11 


C - 


s 


82.90 


69.73 


73.14 


85.04 


76 


.27 


65.00 


88.40 


J - 


s 


140.80 


55.63 


68.14 


73.40 


70 


.62 


65.00 


88.40 



^=Treatment, C=Control, J=Joggers, S=Sedentary. 



TK=Tukey-Krainer, GH=Gaines-Howell, C=Dunnett C, T3=Dunnett T3, 
Tam=Tainhane, H=Hochberg. 
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Table 3* 

Frequency with which the most popular multiple comparison procedures 
were discussed. 

Multiple Comparison 



Procedure 


Frequency 


Percent 


Tukey 


27 


69 


Scheffe 


24 


62 


Fisher LSD 


11 


28 


Newman-Keuls 


10 


26 


Bonferroni 


8 


21 


Dimnett 


6 


15 


Games and Howell 


6 


15 


Duncan 




10 


Dunnett T3 


2 


5 


Dunnett C 


2 


5 
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Table 4. 

Fiequeficy counts of the most popular multiple comparison procedures 
found in journal articles 1985-1988* 



Procedure 





Tukey 


Newman - 


Dunn- 


Scheffe 


Duncan 


Fisher 


Journal^ 


HSD 


Keuls 


Bonferroni 






LSD 


AERJ 


3 


6 


2 


1 






JEP 


18 


26 


7 


9 


1 


3 


mQ 


6 


11 


1 


2 


I 




JEE 


4 


3 




1 






JER 


6 


8 


2 


6 




1 


Total 


37 


5A 


12 


19 


2 


4 



AERJ =American Educational Research Journal , JEP =Joumal of Educational 
Psychology , R]RQ =Reading Research Quarterly , J EE = Journal of 
Experimental Education , JER =Journal of Educational Research . 
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